Introduction

According to Centers for Disease Control and Prevention, in 2015-2016, more than 12% of adults at the age of 20 and above in the United States have cholesterol levels higher than 240mg/ml. A high cholesterol level is one of the major risk factors for coronary heart disease, heart attack and stroke. While there are two types of cholesterol, low-density lipoprotein (LDL) is known as the “bad” one that leads to the buildup of cholesterol in arteries. The goal of this project is to explore factors that can affect the level of LDL in a human body, and the result is intended to help readers gain insights on how to balance the level of LDL in order to control/prevent cardiovascular diseases.

The analysis result shows that …

Data

The data analysis is performed with the NHANES 2015-2016 data. The dependent variable, the level of low-density lipoprotein (LDL), measured by mg/dl, is selected from the dataset Cholesterol - LDL & Triglycerides of the laboratory data. Since a high level of triglyceride is believed to be associated with a high level of LDL, it is included in the model as the independent variable. Triglyceride data is also obtained from Cholesterol - LDL & Triglycerides of the laboratory data. We also include blood pressure readings from Blood Pressure dataset of the examination data. According to the data description, some participants have multiple blood pressure readings. For simplicity, we use averaged systolic and diastolic blood pressure readings as blood pressure measurements for each individual. Averaged intakes of fat and cholesterol, computed using both the First and the Second Day Total Nutrient Intakes of the dietary data, are added as independent variables as well. To account for more individual differences, we also include gender, race and age from the Demographics Data, and height, weight and BMI information from Body Measures of the examination data as additional covariates. Furthermore, SEQN, the respondent sequence number, is utilized as the unique identifier to match responses for each respondent. Finally, we removed all rows containing missing values, and there are a total of 2503 observations available for further analysis.

Methods

We fit models using multiple linear regression techniques and then perform model selections to choose the model that best describes the level of LDL. For the very first model, we regress the dependent variable LDL on all predictors:

\(\mathbf{LDL}\) ~ \(\mathbf{age + race + gender + height + weight + BMI + fat + cholesterol + triglyceride + diastolic + systolic}\) (1)

Note that covariates gender and race are treated as categorical variables.

A check of the relationship between residuals and fitted values suggests a transformation for the dependent variable (shows non-linearity).

Residual plot of the full model

Residual plot of the full model

With the help of the Box-Cox test:

Box-Cox Transformation plot

Box-Cox Transformation plot

we identify that the square root transformation is the best choice.

We then fit a new linear model with the transformed dependent variable, \(\sqrt{LDL}\):

\(\mathbf{\sqrt{LDL}}\) ~ \(\mathbf{age + race + gender + height + weight + BMI + fat + cholesterol + triglycerides + diastolic + systolic}\) (2)

As the model has as many as 11 covariates and some of them are insignificant under t-test, we then use the stepwise selection technique to choose variables that best explain \(\sqrt{LDL}\).

Besides, we consider transformations upon predictors. Considering partial residual plots:

Partial Residual Plot

Partial Residual Plot

we find out that variables age and triglycerides violate the linear structure assumption. Both of these plots exhibit a quadratic form, so in addition to response variables in the full model (2) above, we add age2 and triglycerides2 to the linear regression model:

\(\mathbf{\sqrt{LDL}}\) ~ \(\mathbf{age + age^2 + race + gender + height + weight + BMI + fat + cholesterol + triglycerides + triglycerides^2 + diastolic + systolic}\) (3)

Just as the process before, we will execute the stepwise model selection technique to choose the most significant variables of this model.

Steps outlined above are carried out in R, Stata and Python. In R, we use the package data.table for data cleaning.

Core Analysis

R

Note that for R code, we will use data.table() package to mutating the data, and after mutating, we will focus on the analysis.

Data Cleaning

## Group Project HTML
## Author: Huayu Li, huayuli@umich.edu
## Updated: Dec. 8 2019

#### Data cleaning using data.table 

## Libraries: -------------------------------------------------------------------------
library(data.table)
library(foreign)
library(tidyverse) 

## 80: --------------------------------------------------------------------------------

## Read the datasets
demo=data.table(read.xport("./Original Data/DEMO_I.XPT.txt"))
tot1=data.table(read.xport("./Original Data/DR1TOT_I.XPT.txt"))
tot2=data.table(read.xport("./Original Data/DR2TOT_I.XPT.txt"))
b_pres=data.table(read.xport("./Original Data/BPX_I.XPT.txt"))
ldl=data.table(read.xport("./Original Data/TRIGLY_I.XPT.txt"))
measure=data.table(read.xport("./Original Data/BMX_I.XPT.txt"))

## For each dataset, choose the proper variables and make 
## some transformation.

# For demo dataset, we choose seqn, gender, age and race variables
Demo=demo[,.(seqn=SEQN,gender=as.factor(RIAGENDR),
             age=RIDAGEYR,race=as.factor(RIDRETH3))]

# For dietary data, we choose seqn, intake fat, intake cholesterol
# for each day. 
TOT1=tot1[,.(seqn=SEQN,intake_fat1=DR1TTFAT,
             intake_chol1=DR1TCHOL)]
TOT2=tot2[,.(seqn=SEQN,intake_fat2=DR2TTFAT,
             intake_chol2=DR2TCHOL)]

## Next we will use the average intake of the two days into
## our model. The average step is as following:
intake_type=c('intake_fat1','intake_fat2','intake_chol1','intake_chol2')
TOT=TOT1%>%merge(.,TOT2,by='seqn',all=FALSE)
TOT=melt(TOT,measure=intake_type)[
    ,.(seqn,type=factor(variable,intake_type,c(rep('intake_fat',times=2),
                                               rep('intake_chol',times=2))),
       variable,value)  
    ][
      ,.(intake=mean(value,na.rm=TRUE)),by=.(seqn,type)
    ]
  
TOT=dcast(TOT,...~type,value.var=c('intake'))  

# For blood pressure, we choose seqn, systolic pressures and diastolic 
# pressures. We then use the average pressure as the final pressure.
pres_type=c(paste('sys',1:4,sep=''),paste('dia',1:4,sep=''))
B_pres=b_pres[,.(seqn=SEQN,sys1=BPXSY1,sys2=BPXSY2,sys3=BPXSY3,sys4=BPXSY4,
                 dia1=BPXDI1,dia2=BPXDI2,dia3=BPXDI3,dia4=BPXDI4)]
B_pres=melt(B_pres,measure=pres_type)[,
                                      .(seqn,type=factor(variable,pres_type,
                                                         c(rep('s',times=4),rep('d',times=4))),
                                        variable,pressure=value)
                                      ][
                                        ,.(pres=mean(pressure,na.rm=TRUE)),by=.(seqn,type)
                                        ]
B_pres=dcast(B_pres,...~type,value.var=c('pres'))[
  ,.(seqn,systolic=s,diastolic=d)
  ]

# For ldl dataset, we choose seqn, LDL-cholesterol and Triglyceride
# for mg/dL.
LDL=ldl[,.(seqn=SEQN,ldl=LBDLDL,triglycerides=LBXTR)]

# For body measure dataset, we choose weight height and bmi as our 
# variables.
Measure=measure[,.(seqn=SEQN,weight=BMXWT,height=BMXHT,bmi=BMXBMI)]


## Now merge the datasets into one whole, with the seqn as
## the merging label. By the way, some seqn labels 
## should be removed, for they are not included in LDL dataset.

Data=Demo%>%merge(.,TOT,by='seqn',all=FALSE)%>%
  merge(.,B_pres,by='seqn',all=FALSE)%>%
  merge(.,LDL,by='seqn',all=FALSE)%>%
  merge(.,Measure,by='seqn',all=FALSE)%>%
  na.omit()

Models

### Using this file for regression: ---------------------------------------------------

## Libraries: -------------------------------------------------------------------------
library(lme4)
library(MASS)
library(car)

## 80: --------------------------------------------------------------------------------

## Remove the seqn variable, and set gender and race as factor variables
DT=Data[,.(gender=as.factor(gender),age,race=as.factor(race),intake_fat,intake_chol,
         systolic,diastolic,ldl,triglycerides,weight,height,bmi)]

## First of all, we will fit the model with all variables, and then give 
## the residual plot of the model.
L1=lm(ldl~.,data=DT)
summary(L1)
## 
## Call:
## lm(formula = ldl ~ ., data = DT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -113.646  -22.757   -2.451   19.297  149.850 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   101.571467  44.874374   2.263   0.0237 *  
## gender2         1.171152   1.847729   0.634   0.5262    
## age             0.175782   0.038684   4.544 5.78e-06 ***
## race2           3.952555   2.414177   1.637   0.1017    
## race3           2.036991   2.093998   0.973   0.3308    
## race4           3.506045   2.285949   1.534   0.1252    
## race6           5.104577   2.738375   1.864   0.0624 .  
## race7           6.036927   3.880265   1.556   0.1199    
## intake_fat      0.005288   0.022726   0.233   0.8160    
## intake_chol    -0.001265   0.004397  -0.288   0.7736    
## systolic       -0.005915   0.045972  -0.129   0.8976    
## diastolic       0.401837   0.057452   6.994 3.41e-12 ***
## triglycerides   0.142759   0.011379  12.546  < 2e-16 ***
## weight          0.301651   0.266228   1.133   0.2573    
## height         -0.318111   0.269684  -1.180   0.2383    
## bmi            -0.595490   0.741391  -0.803   0.4219    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.34 on 2487 degrees of freedom
## Multiple R-squared:  0.1342, Adjusted R-squared:  0.129 
## F-statistic: 25.69 on 15 and 2487 DF,  p-value: < 2.2e-16
## Here we give the residual plot of the model
plot(L1$fitted.values,L1$residuals)

## Here, it seems that some transformations should be used upon ldl. Here
## we do the Box-Cox test.
boxcox(L1,plotit=TRUE,lambda=seq(0,1,1/100))
## Here it seems that lambda=0.5 is the best choice, that is, to use sqrt(ldl).
## Here we make the transformation and then do the regression again.
L2=lm(sqrt(ldl)~.,data=DT)
summary(L2)
## 
## Call:
## lm(formula = sqrt(ldl) ~ ., data = DT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1905 -1.0601  0.0021  1.0178  5.5148 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.005e+01  2.158e+00   4.658 3.36e-06 ***
## gender2        5.965e-02  8.886e-02   0.671   0.5021    
## age            8.429e-03  1.860e-03   4.531 6.15e-06 ***
## race2          1.830e-01  1.161e-01   1.576   0.1151    
## race3          8.270e-02  1.007e-01   0.821   0.4116    
## race4          1.305e-01  1.099e-01   1.187   0.2354    
## race6          2.302e-01  1.317e-01   1.748   0.0806 .  
## race7          2.744e-01  1.866e-01   1.470   0.1416    
## intake_fat     4.259e-04  1.093e-03   0.390   0.6968    
## intake_chol   -8.613e-05  2.115e-04  -0.407   0.6838    
## systolic      -3.532e-04  2.211e-03  -0.160   0.8731    
## diastolic      2.006e-02  2.763e-03   7.261 5.11e-13 ***
## triglycerides  6.538e-03  5.472e-04  11.948  < 2e-16 ***
## weight         1.492e-02  1.280e-02   1.165   0.2440    
## height        -1.644e-02  1.297e-02  -1.268   0.2050    
## bmi           -2.700e-02  3.566e-02  -0.757   0.4489    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.604 on 2487 degrees of freedom
## Multiple R-squared:  0.1327, Adjusted R-squared:  0.1275 
## F-statistic: 25.38 on 15 and 2487 DF,  p-value: < 2.2e-16
## There are too many variables in the regression model, so here we will do
## the model selection and choose the variables. Here we do both the forward
## and backward selections.
L3=step(L2,direction='both',trace=FALSE)
summary(L3)
## 
## Call:
## lm(formula = sqrt(ldl) ~ age + diastolic + triglycerides + weight + 
##     height, data = DT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2425 -1.0646  0.0056  1.0406  5.5286 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.8488834  0.5771649  15.332  < 2e-16 ***
## age            0.0079968  0.0016065   4.978 6.87e-07 ***
## diastolic      0.0203065  0.0025879   7.847 6.28e-15 ***
## triglycerides  0.0064965  0.0005279  12.306  < 2e-16 ***
## weight         0.0049061  0.0017000   2.886  0.00394 ** 
## height        -0.0083689  0.0036375  -2.301  0.02149 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.603 on 2497 degrees of freedom
## Multiple R-squared:  0.1304, Adjusted R-squared:  0.1286 
## F-statistic: 74.87 on 5 and 2497 DF,  p-value: < 2.2e-16

From the output result, we find out that variables age, diastolic, triglycerides, weight, height are selected, and they are all significant under t-test.

In this model, variables age, diastolic, triglycerides and weight are positive correlated to the fitted level of ldl, while height is negative correlated: with other variables fixed, one year of age increase leads to 0.008 unit increase in \(\sqrt{ldl}\), and 1 unit diastolic increase leads to 0.02 unit increase in \(\sqrt{ldl}\); 1 unit increase in triglycerides leads to 0.006 unit increase in \(\sqrt{ldl}\), and for weight this will lead to 0.005 unit increase in \(\sqrt{ldl}\); for height, this will lead to 0.008 unit decrease in \(\sqrt{ldl}\). The \(R^2\) is 0.1304, and the residual standard error is 1.603.

## By the way, in the models before, we didn't consider transformations 
## upon predictors; in the coming part, we will consider adding some
## nonlinear terms.
crPlots(L2,layout=c(4,3))
## From the partial residual plots, we can find out that for triglycerides and age,
## some nonlinear transformation forms should be add. We add this term, and the
## regression result is as following:
L4=lm(sqrt(ldl)~gender+age+race+intake_fat+intake_chol+systolic+diastolic+
             weight+height+bmi+triglycerides+I(triglycerides^2)+I(age^2),data=DT)
summary(L4)
## 
## Call:
## lm(formula = sqrt(ldl) ~ gender + age + race + intake_fat + intake_chol + 
##     systolic + diastolic + weight + height + bmi + triglycerides + 
##     I(triglycerides^2) + I(age^2), data = DT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2795 -0.9630 -0.0056  0.9958  5.4639 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.155e+01  2.065e+00   5.593 2.47e-08 ***
## gender2            -7.780e-02  8.572e-02  -0.908  0.36421    
## age                 1.075e-01  8.753e-03  12.282  < 2e-16 ***
## race2               8.510e-02  1.109e-01   0.767  0.44310    
## race3               1.954e-01  9.673e-02   2.020  0.04349 *  
## race4               2.324e-01  1.053e-01   2.207  0.02739 *  
## race6               1.303e-01  1.258e-01   1.035  0.30054    
## race7               3.578e-01  1.782e-01   2.008  0.04472 *  
## intake_fat         -7.455e-05  1.043e-03  -0.071  0.94303    
## intake_chol        -1.705e-04  2.019e-04  -0.844  0.39857    
## systolic            3.840e-03  2.141e-03   1.794  0.07298 .  
## diastolic           6.303e-03  2.858e-03   2.206  0.02750 *  
## weight              2.520e-02  1.223e-02   2.060  0.03950 *  
## height             -3.457e-02  1.246e-02  -2.775  0.00556 ** 
## bmi                -7.122e-02  3.413e-02  -2.087  0.03703 *  
## triglycerides       2.150e-02  1.660e-03  12.953  < 2e-16 ***
## I(triglycerides^2) -5.019e-05  5.005e-06 -10.027  < 2e-16 ***
## I(age^2)           -1.138e-03  9.593e-05 -11.862  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.53 on 2485 degrees of freedom
## Multiple R-squared:  0.2114, Adjusted R-squared:  0.206 
## F-statistic:  39.2 on 17 and 2485 DF,  p-value: < 2.2e-16
## Just the same, do the model selection.
L5=step(L4,direction='both',trace=FALSE)
summary(L5)
## 
## Call:
## lm(formula = sqrt(ldl) ~ age + systolic + diastolic + weight + 
##     height + bmi + triglycerides + I(triglycerides^2) + I(age^2), 
##     data = DT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2347 -0.9663 -0.0208  1.0166  5.4756 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.109e+01  2.011e+00   5.516 3.82e-08 ***
## age                 1.047e-01  8.550e-03  12.249  < 2e-16 ***
## systolic            4.101e-03  2.091e-03   1.961   0.0500 *  
## diastolic           6.455e-03  2.832e-03   2.279   0.0228 *  
## weight              2.636e-02  1.221e-02   2.159   0.0310 *  
## height             -3.121e-02  1.214e-02  -2.571   0.0102 *  
## bmi                -7.503e-02  3.398e-02  -2.208   0.0274 *  
## triglycerides       2.116e-02  1.620e-03  13.060  < 2e-16 ***
## I(triglycerides^2) -4.954e-05  4.948e-06 -10.011  < 2e-16 ***
## I(age^2)           -1.105e-03  9.325e-05 -11.849  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.53 on 2493 degrees of freedom
## Multiple R-squared:  0.2085, Adjusted R-squared:  0.2057 
## F-statistic: 72.98 on 9 and 2493 DF,  p-value: < 2.2e-16

After model selection, we can find out that terms age, systolic, diastolic, weight, height, bmi, triglycerides, \(triglycerides^2\) and \(age^2\) are selected, and they are significant under t-test. The height bmi and the two square terms are negatively correlated with ldl, with other variables positively correlated with ldl. The residual standard error changes to 1.53, and the \(R^2\) increases to 0.2085, which means that this model performs better than the one without square terms.

Stata

Data Cleaning

* import demographic data
import sasxport "./Original Data/DEMO_I.XPT.txt", clear

* rename variables
rename riagendr gender
rename ridageyr age
rename ridreth3 race

* select variables of focus
keep seqn gender age race

* save the cleaned demographics data
save "./Xiru Lyu/Data/demo.dta", replace

* import diet (day 1) data
import sasxport "./Original Data/DR1TOT_I.XPT.txt", clear

* rename variables
rename dr1ttfat fat1
rename dr1tchol chol1

* select variables of interest
keep seqn fat1 chol1

* save the cleaned diet (day 1) dataset
save "./Xiru Lyu/Data/diet1.dta", replace

* import diet (day 2) data
import sasxport "./Original Data/DR2TOT_I.XPT.txt", clear

* rename variables
rename dr2ttfat fat2
rename dr2tchol chol2

* select variables of focus
keep seqn fat2 chol2

* save the cleaned diet (day 2) dataset
save "./Xiru Lyu/Data/diet2.dta", replace

* import LDL & triglyceride data
import sasxport "./Original Data/TRIGLY_I.XPT.txt", clear

* rename variables
rename lbdldl ldl
rename lbxtr triglyceride

* select variables of interest
keep seqn ldl trig

* save the cleaned cholesterol dataset
save "./Xiru Lyu/Data/ldl.dta", replace

* import blood pressure data
import sasxport "./Original Data/BPX_I.XPT.txt", clear

* rename variables
rename bpxsy1 sy1
rename bpxsy2 sy2
rename bpxsy3 sy3
rename bpxsy4 sy4
rename bpxdi1 di1
rename bpxdi2 di2
rename bpxdi3 di3
rename bpxdi4 di4

* compute averaged systolic and diastolic blood pressure for each participant
egen systolic = rowmean(sy1 sy2 sy3 sy4)
egen diastolic = rowmean(di1 di2 di3 di4)

* select variables of interest
keep seqn systolic diastolic

* save the cleaned blood pressure dataset
save "./Xiru Lyu/Data/bp.dta", replace

* import body measure data
import sasxport "./Original Data/BMX_I.XPT.txt", clear

* rename variables
rename bmxwt weight
rename bmxbmi bmi
rename bmxht height

* select variables of interest
keep seqn weight height bmi

* merge datasets by seqn
merge 1:1 seqn using "./Xiru Lyu/Data/demo.dta"
keep if _merge == 3
drop _merge
merge 1:1 seqn using "./Xiru Lyu/Data/diet1.dta"
keep if _merge == 3
drop _merge
merge 1:1 seqn using "./Xiru Lyu/Data/diet2.dta"
keep if _merge == 3
drop _merge
merge 1:1 seqn using "./Xiru Lyu/Data/bp.dta"
keep if _merge == 3
drop _merge
merge 1:1 seqn using "./Xiru Lyu/Data/ldl.dta"
keep if _merge == 3
drop _merge

* compute averaged intakes of fat and cholesterol
egen fat = rowmean(fat1 fat2)
egen chol = rowmean(chol1 chol2)

* drop extra columns
drop fat1 fat2 chol1 chol2

* drop rows with missing values
foreach var of varlist age bmi chol diastolic fat gender height ldl race ///
seqn systolic triglyceride weight{
drop if missing(`var')
}

* save the dataset for data analysis
save "./Xiru Lyu/Data/final.dta", replace

Models

* transform the dependent variable
generate ldl2 = sqrt(ldl)

* fit a multiple linear regression model
regress ldl2 age i.race i.gender bmi weight height diastolic systolic chol ///
fat trig
Regression Result of Full Model (2)

Regression Result of Full Model (2)

I first fitted the full model, including all possible covariates and the transformed dependent variable, and then I used forward and backward stepwise selections for model selections. To be consistent with the result produced by R, I used the result from the backward selection for later AIC/BIC comparison.

* backward selection
xi: stepwise, pr(.1): regress ldl2 age i.race i.gender bmi weight height ///
diastolic systolic chol fat triglyceride
Backward Selection Result of Full Model (2)

Backward Selection Result of Full Model (2)

* forward selection
xi: stepwise, pe(.1): regress ldl2 age i.race i.gender bmi weight height ///
diastolic systolioc chol fat triglyceride
Forward Selection Result of Full Model (2)

Forward Selection Result of Full Model (2)

* transform covariates
generate triglyceride2 = triglyceride^2
generate age2 = age^2

* fit a multiple linear regression model
regress ldl2 age age2 i.race i.gender bmi weight height diastolic systolic ///
chol fat triglyceride triglyceride2

I then fitted another model that contains two extra transformed independent variables. With the stepwise model selection procedure, I kept the model selected by the backward selection as the one for further AIC/BIC comparison so that my result is consistent with the result produced by R.

Regression Result of Full Model (3)

Regression Result of Full Model (3)

* backward selection
xi: stepwise, pr(.05): regress ldl2 age age2 i.race i.gender bmi weight  ///
height diastolic systolic chol fat triglyceride triglyceride2
Backward Selection Result of Full Model (3)

Backward Selection Result of Full Model (3)

* forward selection
xi: stepwise, pe(.05): regress ldl2 age age2 i.race i.gender bmi weight ///
height diastolic systolic chol fat triglyceride triglyceride2
Forward Selection Result of Full Model (3)

Forward Selection Result of Full Model (3)

* compare AIC & BIC of two nested models
regress ldl2 age height weight diastolic triglyceride
estat ic
AIC for Model 2_a

AIC for Model 2_a

regress ldl2 age age2 height weight bmi diastolic systolic triglyceride ///
triglyceride2
estat ic
AIC for Model 2_b

AIC for Model 2_b

A comparison for AIC/BIC for two nested models shows that the model \(\mathbf{\sqrt{LDL}}\) ~ \(\mathbf{age + age^2 + height + weight + BMI + triglyceride + triglyceride^2 + diastolic + systolic}\) is a better one. Inferential statistics for the model is produced below. Covariates age, triglyceride, weight, diastolic and systolic blood pressures are positively correlated with the level of LDL, while {age2}, {triglyceride2} and BMI are negatively correlated with the response variable.

Specifically, with other variables fixed, one year increase in age leads to approximately 0.10 unit of increase in \(\sqrt{ldl}\). Also, the rate of increase for the level of \(\sqrt{ldl}\) slows down as one ages. One unit of increase in diastolic blood pressure can increase \(\sqrt{ldl}\) by 0.006 unit. One unit of increase in systolic blood pressure can increase \(\sqrt{ldl}\) by 0.004 unit. One unit of increase in BMI decreases \(\sqrt{ldl}\) by 0.075 unit. One unit of increase in weight and triglyceride can bump up \(\sqrt{ldl}\) by .026 and .021 unit, respectively. Finally, one unit increase in height leads to approximately 0.026 unit of decrease in \(\sqrt{ldl}\).

Python

Data Cleaning

import pandas as pd
import numpy as np
LDLdata=pd.read_sas(r'./Original Data/TRIGLY_I.XPT',encoding='utf8')
DR1=pd.read_sas(r'./Original Data/DR1TOT_I.XPT',encoding='utf8')
DR2=pd.read_sas(r'./Original Data/DR2TOT_I.XPT',encoding='utf8')
BPX=pd.read_sas(r'./Original Data/BPX_I.XPT',encoding='utf8')
DEMO=pd.read_sas(r'./Original Data/DEMO_I.XPT',encoding='utf8')
BMI=pd.read_sas(r'./Original Data/BMX_I.XPT',encoding='utf8')

LDL=LDLdata[['SEQN','LBXTR', 'LBDLDL']]
#WTSAF2YR:MEC weight
#LBXTR: triglyceride(mg/dl)
#LBDLDL:  LDL mg/dl

BloodP=BPX[['SEQN']]
BloodP['BPXSY']=np.nanmean(BPX[['BPXSY1','BPXSY2','BPXSY3','BPXSY4']],axis=1)
## /Users/lihuayu/anaconda3/bin/python:1: RuntimeWarning: Mean of empty slice
## /Users/lihuayu/anaconda3/bin/python:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
BloodP['BPXDI']=np.nanmean(BPX[['BPXDI1','BPXDI2','BPXDI3','BPXDI4']],axis=1)
BloodP
##          SEQN       BPXSY       BPXDI
## 0     83732.0  122.666667   65.333333
## 1     83733.0  140.000000   86.000000
## 2     83734.0  135.333333   45.333333
## 3     83735.0  134.000000   70.000000
## 4     83736.0  104.000000   60.000000
## 5     83737.0  119.333333   58.666667
## 6     83738.0  100.000000   48.000000
## 7     83739.0         NaN         NaN
## 8     83740.0         NaN         NaN
## 9     83741.0  111.333333   72.666667
## 10    83742.0  118.000000   70.666667
## 11    83743.0         NaN         NaN
## 12    83744.0  179.333333  111.333333
## 13    83745.0  111.333333   78.000000
## 14    83746.0         NaN         NaN
## 15    83747.0  148.000000   92.000000
## 16    83748.0         NaN         NaN
## 17    83749.0  113.333333   58.666667
## 18    83750.0  110.666667   72.000000
## 19    83751.0  103.333333   53.333333
## 20    83752.0  104.000000   50.000000
## 21    83753.0  121.333333   61.333333
## 22    83754.0  118.666667   72.000000
## 23    83755.0  133.333333   81.333333
## 24    83756.0  120.000000   64.000000
## 25    83757.0  142.666667   62.666667
## 26    83759.0  104.000000   68.666667
## 27    83760.0         NaN         NaN
## 28    83761.0  107.333333   61.333333
## 29    83762.0  140.000000   76.666667
## ...       ...         ...         ...
## 9514  93672.0         NaN         NaN
## 9515  93673.0         NaN         NaN
## 9516  93674.0   94.666667   56.000000
## 9517  93675.0  112.666667   59.333333
## 9518  93676.0  115.333333   76.666667
## 9519  93677.0  113.333333   75.333333
## 9520  93678.0         NaN         NaN
## 9521  93679.0  136.000000   68.666667
## 9522  93680.0  112.000000   58.666667
## 9523  93682.0  126.000000   81.333333
## 9524  93683.0         NaN         NaN
## 9525  93684.0  110.666667   71.333333
## 9526  93685.0  127.333333   55.333333
## 9527  93686.0  112.000000   70.000000
## 9528  93687.0  114.666667   61.333333
## 9529  93688.0         NaN         NaN
## 9530  93689.0  164.000000   66.000000
## 9531  93690.0  115.333333   62.000000
## 9532  93691.0  112.000000   76.000000
## 9533  93692.0         NaN         NaN
## 9534  93693.0         NaN         NaN
## 9535  93694.0         NaN         NaN
## 9536  93695.0  111.333333   47.333333
## 9537  93696.0  116.000000   72.000000
## 9538  93697.0  148.000000   55.333333
## 9539  93698.0         NaN         NaN
## 9540  93699.0         NaN         NaN
## 9541  93700.0  104.666667   65.333333
## 9542  93701.0  114.000000   48.666667
## 9543  93702.0  118.666667   66.000000
## 
## [9544 rows x 3 columns]
drday1=DR1[['SEQN','DR1TTFAT','DR1TCHOL']]
drday2=DR2[['SEQN','DR2TTFAT','DR2TCHOL']]
drboth=pd.merge(drday1,drday2,how='inner',on='SEQN')
drboth['FAT']=np.nanmean(drboth[['DR1TTFAT','DR2TTFAT']],axis=1)
drboth['CHOL']=np.nanmean(drboth[['DR1TCHOL','DR2TCHOL']],axis=1)
dr=drboth[['SEQN','FAT','CHOL']]
demo=DEMO[['SEQN','RIAGENDR','RIDAGEYR','RIDRETH3']]
#ID,GENDER,AGE,RACE(FACTOR)
demo.rename(columns={'RIAGENDR':'GENDER','RIDAGEYR':'AGE','RIDRETH3':'RACE'},inplace=True)
## /Users/lihuayu/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4025: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame
## 
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
##   return super(DataFrame, self).rename(**kwargs)
BMI=BMI[['SEQN','BMXHT','BMXWT','BMXBMI']]
#ID,HEIGHT,WEIGHT
BMI.columns=['SEQN','HEIGHT','WEIGHT','BMI']
Data=pd.merge(LDL,BloodP,how='inner',on='SEQN')
Data=pd.merge(Data,dr,how='inner',on='SEQN')
Data=pd.merge(Data,demo,how='inner',on='SEQN')
Data=pd.merge(Data,BMI,how='inner',on='SEQN')

Data=Data.dropna(axis=0)
Data.columns
## Index(['SEQN', 'LBXTR', 'LBDLDL', 'BPXSY', 'BPXDI', 'FAT', 'CHOL', 'GENDER',
##        'AGE', 'RACE', 'HEIGHT', 'WEIGHT', 'BMI'],
##       dtype='object')
Data.to_csv(r'./kaliu python/Data1.csv',index=None)

Modeling

import pandas as pd
import numpy as np
from sklearn import linear_model
from scipy import stats
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import pylab
import statsmodels.api as sm
## code from github, link  https://github.com/talhahascelik/python_stepwiseSelection/blob/master/stepwiseSelection.py

#Copyright 2019 Sinan Talha Hascelik
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.


def forwardSelection(X, y, model_type ="linear",elimination_criteria = "aic", varchar_process = "dummy_dropfirst", sl=0.05):
    """
    Forward Selection is a function, based on regression models, that returns significant features and selection iterations.\n
    Required Libraries: pandas, numpy, statmodels
    
    Parameters
    ----------
    X : Independent variables (Pandas Dataframe)\n
    y : Dependent variable (Pandas Series, Pandas Dataframe)\n
    model_type : 'linear' or 'logistic'\n
    elimination_criteria : 'aic', 'bic', 'r2', 'adjr2' or None\n
        'aic' refers Akaike information criterion\n
        'bic' refers Bayesian information criterion\n
        'r2' refers R-squared (Only works on linear model type)\n
        'r2' refers Adjusted R-squared (Only works on linear model type)\n
    varchar_process : 'drop', 'dummy' or 'dummy_dropfirst'\n
        'drop' drops varchar features\n
        'dummy' creates dummies for all levels of all varchars\n
        'dummy_dropfirst' creates dummies for all levels of all varchars, and drops first levels\n
    sl : Significance Level (default: 0.05)\n
    
    Returns
    -------
    columns(list), iteration_logs(str)\n\n
    Not Returns a Model
    
    Tested On
    ---------
    Python v3.6.7, Pandas v0.23.4, Numpy v1.15.04, StatModels v0.9.0
    
    See Also
    --------
    https://en.wikipedia.org/wiki/Stepwise_regression
    """
    X = __varcharProcessing__(X,varchar_process = varchar_process)
    return __forwardSelectionRaw__(X, y, model_type = model_type,elimination_criteria = elimination_criteria , sl=sl)
    
def backwardSelection(X, y, model_type ="linear",elimination_criteria = "aic", varchar_process = "dummy_dropfirst", sl=0.05):
    """
    Backward Selection is a function, based on regression models, that returns significant features and selection iterations.\n
    Required Libraries: pandas, numpy, statmodels
    
    Parameters
    ----------
    X : Independent variables (Pandas Dataframe)\n
    y : Dependent variable (Pandas Series, Pandas Dataframe)\n
    model_type : 'linear' or 'logistic'\n
    elimination_criteria : 'aic', 'bic', 'r2', 'adjr2' or None\n
        'aic' refers Akaike information criterion\n
        'bic' refers Bayesian information criterion\n
        'r2' refers R-squared (Only works on linear model type)\n
        'r2' refers Adjusted R-squared (Only works on linear model type)\n
    varchar_process : 'drop', 'dummy' or 'dummy_dropfirst'\n
        'drop' drops varchar features\n
        'dummy' creates dummies for all levels of all varchars\n
        'dummy_dropfirst' creates dummies for all levels of all varchars, and drops first levels\n
    sl : Significance Level (default: 0.05)\n
    
    Returns
    -------
    columns(list), iteration_logs(str)\n\n
    Not Returns a Model
    
    Tested On
    ---------
    Python v3.6.7, Pandas v0.23.4, Numpy v1.15.04, StatModels v0.9.0
    
    See Also
    --------
    https://en.wikipedia.org/wiki/Stepwise_regression    
    """
    X = __varcharProcessing__(X,varchar_process = varchar_process)
    return __backwardSelectionRaw__(X, y, model_type = model_type,elimination_criteria = elimination_criteria , sl=sl)

def __varcharProcessing__(X, varchar_process = "dummy_dropfirst"):
    
    dtypes = X.dtypes
    if varchar_process == "drop":   
        X = X.drop(columns = dtypes[dtypes == np.object].index.tolist())
        print("Character Variables (Dropped):", dtypes[dtypes == np.object].index.tolist())
    elif varchar_process == "dummy":
        X = pd.get_dummies(X,drop_first=False)
        print("Character Variables (Dummies Generated):", dtypes[dtypes == np.object].index.tolist())
    elif varchar_process == "dummy_dropfirst":
        X = pd.get_dummies(X,drop_first=True)
        print("Character Variables (Dummies Generated, First Dummies Dropped):", dtypes[dtypes == np.object].index.tolist())
    else: 
        X = pd.get_dummies(X,drop_first=True)
        print("Character Variables (Dummies Generated, First Dummies Dropped):", dtypes[dtypes == np.object].index.tolist())
    
    X["intercept"] = 1
    cols = X.columns.tolist()
    cols = cols[-1:] + cols[:-1]
    X = X[cols]
    
    return X

def __forwardSelectionRaw__(X, y, model_type ="linear",elimination_criteria = "aic", sl=0.05):

    iterations_log = ""
    cols = X.columns.tolist()
    
    def regressor(y,X, model_type=model_type):
        if model_type == "linear":
            regressor = sm.OLS(y, X).fit()
        elif model_type == "logistic":
            regressor = sm.Logit(y, X).fit()
        else:
            print("\nWrong Model Type : "+ model_type +"\nLinear model type is seleted.")
            model_type = "linear"
            regressor = sm.OLS(y, X).fit()
        return regressor
    
    selected_cols = ["intercept"]
    other_cols = cols.copy()
    other_cols.remove("intercept")
    
    model = regressor(y, X[selected_cols])
    
    if elimination_criteria == "aic":
        criteria = model.aic
    elif elimination_criteria == "bic":
        criteria = model.bic
    elif elimination_criteria == "r2" and model_type =="linear":
        criteria = model.rsquared
    elif elimination_criteria == "adjr2" and model_type =="linear":
        criteria = model.rsquared_adj
    
    
    for i in range(X.shape[1]):
        pvals = pd.DataFrame(columns = ["Cols","Pval"])
        for j in other_cols:
            model = regressor(y, X[selected_cols+[j]])
            pvals = pvals.append(pd.DataFrame([[j, model.pvalues[j]]],columns = ["Cols","Pval"]),ignore_index=True)
        pvals = pvals.sort_values(by = ["Pval"]).reset_index(drop=True)
        pvals = pvals[pvals.Pval<=sl]
        if pvals.shape[0] > 0:
            
            model = regressor(y, X[selected_cols+[pvals["Cols"][0]]])
            iterations_log += str("\nEntered : "+pvals["Cols"][0] + "\n")    
            iterations_log += "\n\n"+str(model.summary())+"\nAIC: "+ str(model.aic) + "\nBIC: "+ str(model.bic)+"\n\n"
                    
        
            if  elimination_criteria == "aic":
                new_criteria = model.aic
                if new_criteria < criteria:
                    print("Entered :", pvals["Cols"][0], "\tAIC :", model.aic)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("break : Criteria")
                    break
            elif  elimination_criteria == "bic":
                new_criteria = model.bic
                if new_criteria < criteria:
                    print("Entered :", pvals["Cols"][0], "\tBIC :", model.bic)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("break : Criteria")
                    break        
            elif  elimination_criteria == "r2" and model_type =="linear":
                new_criteria = model.rsquared
                if new_criteria > criteria:
                    print("Entered :", pvals["Cols"][0], "\tR2 :", model.rsquared)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("break : Criteria")
                    break           
            elif  elimination_criteria == "adjr2" and model_type =="linear":
                new_criteria = model.rsquared_adj
                if new_criteria > criteria:
                    print("Entered :", pvals["Cols"][0], "\tAdjR2 :", model.rsquared_adj)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("Break : Criteria")
                    break
            else:
                print("Entered :", pvals["Cols"][0])
                selected_cols.append(pvals["Cols"][0])
                other_cols.remove(pvals["Cols"][0])            
                
        else:
            print("Break : Significance Level")
            break
        
    model = regressor(y, X[selected_cols])
    if elimination_criteria == "aic":
        criteria = model.aic
    elif elimination_criteria == "bic":
        criteria = model.bic
    elif elimination_criteria == "r2" and model_type =="linear":
        criteria = model.rsquared
    elif elimination_criteria == "adjr2" and model_type =="linear":
        criteria = model.rsquared_adj
    
    print(model.summary())
    print("AIC: "+str(model.aic))
    print("BIC: "+str(model.bic))
    print("Final Variables:", selected_cols)

    return selected_cols, iterations_log

def __backwardSelectionRaw__(X, y, model_type ="linear",elimination_criteria = "aic", sl=0.05):
    
    iterations_log = ""
    last_eleminated = ""    
    cols = X.columns.tolist()

    def regressor(y,X, model_type=model_type):
        if model_type =="linear":
            regressor = sm.OLS(y, X).fit()
        elif model_type == "logistic":
            regressor = sm.Logit(y, X).fit()
        else:
            print("\nWrong Model Type : "+ model_type +"\nLinear model type is seleted.")
            model_type = "linear"
            regressor = sm.OLS(y, X).fit()
        return regressor
    for i in range(X.shape[1]):
        if i != 0 :          
            if elimination_criteria == "aic":
                criteria = model.aic
                new_model = regressor(y,X)
                new_criteria = new_model.aic
                if criteria < new_criteria:
                    print("Regained : ", last_eleminated)
                    iterations_log += "\n"+str(new_model.summary())+"\nAIC: "+ str(new_model.aic) + "\nBIC: "+ str(new_model.bic)+"\n"
                    iterations_log += str("\n\nRegained : "+last_eleminated + "\n\n")
                    break  
            elif elimination_criteria == "bic":
                criteria = model.bic
                new_model = regressor(y,X)
                new_criteria = new_model.bic
                if criteria < new_criteria:
                    print("Regained : ", last_eleminated)
                    iterations_log += "\n"+str(new_model.summary())+"\nAIC: "+ str(new_model.aic) + "\nBIC: "+ str(new_model.bic)+"\n"
                    iterations_log += str("\n\nRegained : "+last_eleminated + "\n\n")
                    break  
            elif elimination_criteria == "adjr2" and model_type =="linear":
                criteria = model.rsquared_adj
                new_model = regressor(y,X)
                new_criteria = new_model.rsquared_adj
                if criteria > new_criteria:
                    print("Regained : ", last_eleminated)
                    iterations_log += "\n"+str(new_model.summary())+"\nAIC: "+ str(new_model.aic) + "\nBIC: "+ str(new_model.bic)+"\n"
                    iterations_log += str("\n\nRegained : "+last_eleminated + "\n\n")
                    break  
            elif elimination_criteria == "r2" and model_type =="linear":
                criteria = model.rsquared
                new_model = regressor(y,X)
                new_criteria = new_model.rsquared
                if criteria > new_criteria:
                    print("Regained : ", last_eleminated)
                    iterations_log += "\n"+str(new_model.summary())+"\nAIC: "+ str(new_model.aic) + "\nBIC: "+ str(new_model.bic)+"\n"
                    iterations_log += str("\n\nRegained : "+last_eleminated + "\n\n")
                    break   
            else: 
                new_model = regressor(y,X)
            model = new_model
            iterations_log += "\n"+str(model.summary())+"\nAIC: "+ str(model.aic) + "\nBIC: "+ str(model.bic)+"\n"
        else:
            model = regressor(y,X)
            iterations_log += "\n"+str(model.summary())+"\nAIC: "+ str(model.aic) + "\nBIC: "+ str(model.bic)+"\n"
        maxPval = max(model.pvalues)
        cols = X.columns.tolist()
        if maxPval > sl:
            for j in cols:
                if (model.pvalues[j] == maxPval):
                    print("Eliminated :" ,j)
                    iterations_log += str("\n\nEliminated : "+j+ "\n\n")
                    
                    del X[j]
                    last_eleminated = j
        else:
            break
    print(str(model.summary())+"\nAIC: "+ str(model.aic) + "\nBIC: "+ str(model.bic))
    print("Final Variables:", cols)
    iterations_log += "\n"+str(model.summary())+"\nAIC: "+ str(model.aic) + "\nBIC: "+ str(model.bic)+"\n"
    return cols, iterations_log
    
Data=pd.read_csv(r'./kaliu python/Data1.csv')
Data=Data.drop(['SEQN'],axis=1)#drop the id variable 'SEQN' 

RACE=pd.get_dummies(Data['RACE'])
RACENAME=['Mexican American','Other Hispanic','Non-Hispanic White','Non-Hispanic Black','Non-Hispanic Asian','Other Race']
RACE.columns=[ i for i in RACENAME]
RACE=RACE.drop(['Mexican American'],axis=1)

GENDER=pd.get_dummies(Data['GENDER'])
gendername=['Male','Female']
GENDER.columns=[i for i in gendername]
GENDER=GENDER.drop(['Male'],axis=1)

Data=Data.drop(['GENDER','RACE'],axis=1)
Data=pd.concat([Data,GENDER,RACE],axis=1)

y1=Data[['LBDLDL']]       ## lm reg, the response= LBDLDL (LDL)
x1=Data.drop(['LBDLDL'],axis=1) # others except LDL are the variables
x1=sm.add_constant(x1)     ## add intercept, for python, add 1 colunms by hand
## /Users/lihuayu/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
##   return ptp(axis=axis, out=out, **kwargs)
lm1=sm.OLS(y1.astype(float),x1.astype(float)).fit() # OLS y on X+1

lm1.summary()     # coefficient table
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                 LBDLDL   R-squared:                       0.134
## Model:                            OLS   Adj. R-squared:                  0.129
## Method:                 Least Squares   F-statistic:                     25.69
## Date:                Wed, 11 Dec 2019   Prob (F-statistic):           2.47e-67
## Time:                        23:24:47   Log-Likelihood:                -12321.
## No. Observations:                2503   AIC:                         2.467e+04
## Df Residuals:                    2487   BIC:                         2.477e+04
## Df Model:                          15                                         
## Covariance Type:            nonrobust                                         
## ======================================================================================
##                          coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------------
## const                101.5715     44.874      2.263      0.024      13.576     189.566
## LBXTR                  0.1428      0.011     12.546      0.000       0.120       0.165
## BPXSY                 -0.0059      0.046     -0.129      0.898      -0.096       0.084
## BPXDI                  0.4018      0.057      6.994      0.000       0.289       0.514
## FAT                    0.0053      0.023      0.233      0.816      -0.039       0.050
## CHOL                  -0.0013      0.004     -0.288      0.774      -0.010       0.007
## AGE                    0.1758      0.039      4.544      0.000       0.100       0.252
## HEIGHT                -0.3181      0.270     -1.180      0.238      -0.847       0.211
## WEIGHT                 0.3017      0.266      1.133      0.257      -0.220       0.824
## BMI                   -0.5955      0.741     -0.803      0.422      -2.049       0.858
## Female                 1.1712      1.848      0.634      0.526      -2.452       4.794
## Other Hispanic         3.9526      2.414      1.637      0.102      -0.781       8.687
## Non-Hispanic White     2.0370      2.094      0.973      0.331      -2.069       6.143
## Non-Hispanic Black     3.5060      2.286      1.534      0.125      -0.977       7.989
## Non-Hispanic Asian     5.1046      2.738      1.864      0.062      -0.265      10.474
## Other Race             6.0369      3.880      1.556      0.120      -1.572      13.646
## ==============================================================================
## Omnibus:                      121.293   Durbin-Watson:                   1.991
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):              161.325
## Skew:                           0.469   Prob(JB):                     9.30e-36
## Kurtosis:                       3.816   Cond. No.                     2.87e+04
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 2.87e+04. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """

From result, many coefficient are not significant, so to better select variables, we are going to see the resdisual condition and use boxcox to help.

res1=lm1.resid
y1_fit=lm1.predict(x1)
fig1=plt.figure(figsize=(8,6))
plt.plot(y1_fit,res1,'o''c') #'0' means dots,''''means no line between dots, 'c' color
plt.title('Residuals against Y-fited')
plt.ylabel('Residuals')
plt.xlabel('Y-fitted value')
plt.show()

res = lm1.resid  # residuals
stats.probplot(res, dist="norm", plot=pylab) # QQplot, simliar to fig = sm.qqplot(res)
## ((array([-3.45329821, -3.20637418, -3.06964752, ...,  3.06964752,
##         3.20637418,  3.45329821]), array([-113.64623725, -108.54537041,  -89.63120833, ...,  136.31796236,
##         145.92859063,  149.84990144])), (33.06076150557924, 1.636389036476502e-12, 0.9934850678740555))
pylab.show()
Probability Plot

Probability Plot

First, we plot y_fit versus residual, then we plot QQplot of residual, and we can see that the QQplot shows that the residual is not really normal.

#boxcox
y2=np.array(y1).flatten()
fig2=plt.figure()
ax = fig2.add_subplot(111)
prob=stats.boxcox_normplot(y2,0,1,plot=ax)
_, maxlog=stats.boxcox(y2)
ax.axvline(maxlog,color='r')
## <matplotlib.lines.Line2D object at 0x13bf661d0>
plt.show()

From the plot, we can tansform the y value to \(y^\lambda\), when lambda is around 0.4, then result seems best,to make it simple we pick \(\lambda=0.5\).

Data['LDL']=(Data[['LBDLDL']])**0.5 # transform the y value to y**0.5
Data=Data.drop(['LBDLDL'],axis=1)

y3=Data[['LDL']]
x3=Data.drop(['LDL'],axis=1)
final_vars,_=forwardSelection(x3,y3,model_type='linear',elimination_criteria='aic')
## Character Variables (Dummies Generated, First Dummies Dropped): []
## Entered : LBXTR  AIC : 9574.218891776083
## Entered : BPXDI  AIC : 9503.117731043025
## Entered : AGE    AIC : 9475.727581279705
## Entered : BMI    AIC : 9470.123755541823
## Break : Significance Level
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                    LDL   R-squared:                       0.130
## Model:                            OLS   Adj. R-squared:                  0.128
## Method:                 Least Squares   F-statistic:                     93.04
## Date:                Wed, 11 Dec 2019   Prob (F-statistic):           7.62e-74
## Time:                        23:24:49   Log-Likelihood:                -4730.1
## No. Observations:                2503   AIC:                             9470.
## Df Residuals:                    2498   BIC:                             9499.
## Df Model:                           4                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## intercept      7.4982      0.199     37.641      0.000       7.108       7.889
## LBXTR          0.0065      0.001     12.356      0.000       0.005       0.008
## BPXDI          0.0199      0.003      7.769      0.000       0.015       0.025
## AGE            0.0081      0.002      5.007      0.000       0.005       0.011
## BMI            0.0130      0.005      2.757      0.006       0.004       0.022
## ==============================================================================
## Omnibus:                        8.993   Durbin-Watson:                   1.996
## Prob(Omnibus):                  0.011   Jarque-Bera (JB):               11.361
## Skew:                          -0.017   Prob(JB):                      0.00341
## Kurtosis:                       3.328   Cond. No.                         895.
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## AIC: 9470.123755541823
## BIC: 9499.249981998983
## Final Variables: ['intercept', 'LBXTR', 'BPXDI', 'AGE', 'BMI']
final_vars,_=backwardSelection(x3,y3,model_type='linear',elimination_criteria='aic')
## Character Variables (Dummies Generated, First Dummies Dropped): []
## Eliminated : BPXSY
## Eliminated : FAT
## Eliminated : CHOL
## Eliminated : Female
## Eliminated : BMI
## Eliminated : Non-Hispanic White
## Eliminated : Non-Hispanic Black
## Eliminated : Other Hispanic
## Eliminated : Other Race
## Eliminated : Non-Hispanic Asian
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                    LDL   R-squared:                       0.130
## Model:                            OLS   Adj. R-squared:                  0.129
## Method:                 Least Squares   F-statistic:                     74.87
## Date:                Wed, 11 Dec 2019   Prob (F-statistic):           2.88e-73
## Time:                        23:24:49   Log-Likelihood:                -4729.0
## No. Observations:                2503   AIC:                             9470.
## Df Residuals:                    2497   BIC:                             9505.
## Df Model:                           5                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## intercept      8.8489      0.577     15.332      0.000       7.717       9.981
## LBXTR          0.0065      0.001     12.306      0.000       0.005       0.008
## BPXDI          0.0203      0.003      7.847      0.000       0.015       0.025
## AGE            0.0080      0.002      4.978      0.000       0.005       0.011
## HEIGHT        -0.0084      0.004     -2.301      0.021      -0.016      -0.001
## WEIGHT         0.0049      0.002      2.886      0.004       0.002       0.008
## ==============================================================================
## Omnibus:                        8.921   Durbin-Watson:                   1.994
## Prob(Omnibus):                  0.012   Jarque-Bera (JB):               11.283
## Skew:                          -0.012   Prob(JB):                      0.00355
## Kurtosis:                       3.328   Cond. No.                     4.12e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 4.12e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
## AIC: 9470.09711553212
## BIC: 9505.048587280711
## Final Variables: ['intercept', 'LBXTR', 'BPXDI', 'AGE', 'HEIGHT', 'WEIGHT']

From the graphes above, we can see that LBXTR and AGE plots, there are a little inverse ‘U’ shape, so in the final model, we add LBXTR square and AGE square terms.

Data['AGE2']=Data[['AGE']]*Data[['AGE']]
Data['LBXTR2']=Data[['LBXTR']]*Data[['LBXTR']]
y5=Data[['LDL']]
x5=Data.drop(['LDL'],axis=1)

final_vars,_=forwardSelection(x5,y5,model_type='linear',elimination_criteria='aic')
## Character Variables (Dummies Generated, First Dummies Dropped): []
## Entered : LBXTR  AIC : 9574.218891776083
## Entered : LBXTR2     AIC : 9448.966446031709
## Entered : BPXDI  AIC : 9384.22136728393
## Entered : AGE    AIC : 9373.946948986257
## Entered : AGE2   AIC : 9246.164944091768
## Break : Significance Level
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                    LDL   R-squared:                       0.205
## Model:                            OLS   Adj. R-squared:                  0.203
## Method:                 Least Squares   F-statistic:                     128.6
## Date:                Wed, 11 Dec 2019   Prob (F-statistic):          1.75e-121
## Time:                        23:24:51   Log-Likelihood:                -4617.1
## No. Observations:                2503   AIC:                             9246.
## Df Residuals:                    2497   BIC:                             9281.
## Df Model:                           5                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## intercept      6.2939      0.195     32.322      0.000       5.912       6.676
## LBXTR          0.0213      0.002     13.348      0.000       0.018       0.024
## LBXTR2     -4.949e-05    4.9e-06    -10.094      0.000   -5.91e-05   -3.99e-05
## BPXDI          0.0078      0.003      2.971      0.003       0.003       0.013
## AGE            0.1004      0.008     11.994      0.000       0.084       0.117
## AGE2          -0.0010   9.04e-05    -11.528      0.000      -0.001      -0.001
## ==============================================================================
## Omnibus:                        8.099   Durbin-Watson:                   1.962
## Prob(Omnibus):                  0.017   Jarque-Bera (JB):                9.982
## Skew:                          -0.024   Prob(JB):                      0.00680
## Kurtosis:                       3.306   Cond. No.                     1.60e+05
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.6e+05. This might indicate that there are
## strong multicollinearity or other numerical problems.
## AIC: 9246.164944091768
## BIC: 9281.116415840359
## Final Variables: ['intercept', 'LBXTR', 'LBXTR2', 'BPXDI', 'AGE', 'AGE2']
final_vars,_=backwardSelection(x5,y5,model_type='linear',elimination_criteria='aic')
## Character Variables (Dummies Generated, First Dummies Dropped): []
## Eliminated : FAT
## Eliminated : Other Hispanic
## Eliminated : Non-Hispanic Asian
## Eliminated : Female
## Eliminated : CHOL
## Eliminated : Non-Hispanic White
## Regained :  Non-Hispanic White
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                    LDL   R-squared:                       0.210
## Model:                            OLS   Adj. R-squared:                  0.207
## Method:                 Least Squares   F-statistic:                     55.30
## Date:                Wed, 11 Dec 2019   Prob (F-statistic):          1.87e-118
## Time:                        23:24:51   Log-Likelihood:                -4608.2
## No. Observations:                2503   AIC:                             9242.
## Df Residuals:                    2490   BIC:                             9318.
## Df Model:                          12                                         
## Covariance Type:            nonrobust                                         
## ======================================================================================
##                          coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------------
## intercept             11.3355      2.013      5.631      0.000       7.388      15.283
## LBXTR                  0.0216      0.002     13.060      0.000       0.018       0.025
## BPXSY                  0.0040      0.002      1.881      0.060      -0.000       0.008
## BPXDI                  0.0065      0.003      2.279      0.023       0.001       0.012
## AGE                    0.1063      0.009     12.338      0.000       0.089       0.123
## HEIGHT                -0.0333      0.012     -2.731      0.006      -0.057      -0.009
## WEIGHT                 0.0262      0.012      2.143      0.032       0.002       0.050
## BMI                   -0.0757      0.034     -2.228      0.026      -0.142      -0.009
## Non-Hispanic White     0.1253      0.075      1.667      0.096      -0.022       0.273
## Non-Hispanic Black     0.1642      0.088      1.876      0.061      -0.007       0.336
## Other Race             0.2850      0.167      1.702      0.089      -0.043       0.613
## AGE2                  -0.0011   9.45e-05    -11.900      0.000      -0.001      -0.001
## LBXTR2             -5.047e-05      5e-06    -10.099      0.000   -6.03e-05   -4.07e-05
## ==============================================================================
## Omnibus:                        7.701   Durbin-Watson:                   1.963
## Prob(Omnibus):                  0.021   Jarque-Bera (JB):                9.303
## Skew:                          -0.034   Prob(JB):                      0.00955
## Kurtosis:                       3.291   Cond. No.                     1.66e+06
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.66e+06. This might indicate that there are
## strong multicollinearity or other numerical problems.
## AIC: 9242.336445749102
## BIC: 9318.064634537715
## Final Variables: ['intercept', 'LBXTR', 'BPXSY', 'BPXDI', 'AGE', 'HEIGHT', 'WEIGHT', 'BMI', 'Non-Hispanic White', 'Non-Hispanic Black', 'Other Race', 'AGE2', 'LBXTR2']

From AIC comparison, we select variables ‘LBXTR’, ‘BPXSY’, ‘BPXDI’, ‘AGE’, ‘HEIGHT’, ‘WEIGHT’, ‘BMI’, ‘RACE’, ‘AGE2’, ‘LBXTR2’

y=Data[['LDL']]
x=Data[['LBXTR', 'BPXSY', 'BPXDI', 'AGE', 'HEIGHT', 'WEIGHT', 'BMI','Other Hispanic','Non-Hispanic White','Non-Hispanic Black','Non-Hispanic Asian','Other Race', 'AGE2', 'LBXTR2']]
#here we add all RACE types except RACE='Mexican American', which is viewed as base
x=sm.add_constant(x)
## /Users/lihuayu/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
##   return ptp(axis=axis, out=out, **kwargs)
lm_final=sm.OLS(y.astype(float),x.astype(float)).fit()
lm_final.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                    LDL   R-squared:                       0.211
## Model:                            OLS   Adj. R-squared:                  0.206
## Method:                 Least Squares   F-statistic:                     47.49
## Date:                Wed, 11 Dec 2019   Prob (F-statistic):          5.38e-117
## Time:                        23:24:51   Log-Likelihood:                -4607.5
## No. Observations:                2503   AIC:                             9245.
## Df Residuals:                    2488   BIC:                             9332.
## Df Model:                          14                                         
## Covariance Type:            nonrobust                                         
## ======================================================================================
##                          coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------------
## const                 11.1998      2.017      5.552      0.000       7.244      15.155
## LBXTR                  0.0216      0.002     13.012      0.000       0.018       0.025
## BPXSY                  0.0041      0.002      1.922      0.055   -8.27e-05       0.008
## BPXDI                  0.0063      0.003      2.212      0.027       0.001       0.012
## AGE                    0.1057      0.009     12.245      0.000       0.089       0.123
## HEIGHT                -0.0329      0.012     -2.700      0.007      -0.057      -0.009
## WEIGHT                 0.0256      0.012      2.100      0.036       0.002       0.050
## BMI                   -0.0732      0.034     -2.150      0.032      -0.140      -0.006
## Other Hispanic         0.0896      0.111      0.810      0.418      -0.127       0.307
## Non-Hispanic White     0.1879      0.094      1.997      0.046       0.003       0.372
## Non-Hispanic Black     0.2247      0.103      2.175      0.030       0.022       0.427
## Non-Hispanic Asian     0.1355      0.125      1.081      0.280      -0.110       0.381
## Other Race             0.3472      0.177      1.966      0.049       0.001       0.694
## AGE2                  -0.0011   9.46e-05    -11.824      0.000      -0.001      -0.001
## LBXTR2             -5.029e-05      5e-06    -10.053      0.000   -6.01e-05   -4.05e-05
## ==============================================================================
## Omnibus:                        7.784   Durbin-Watson:                   1.965
## Prob(Omnibus):                  0.020   Jarque-Bera (JB):                9.359
## Skew:                          -0.038   Prob(JB):                      0.00928
## Kurtosis:                       3.290   Cond. No.                     1.66e+06
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.66e+06. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """

Additional Analysis

R

## Additional Analysis: Some extra graphs

## Libraries: -------------------------------------------------------------------------
library(ggplot2)
library(gridExtra)

## 80: --------------------------------------------------------------------------------

Linear mixed model

## Before graphing, we will have the linear mixed model upon the dataset:
LM=lmer(ldl~age+intake_fat+intake_chol+systolic+diastolic+
          weight+height+bmi+triglycerides+(1|gender)+(1|race),data=DT)
summary(LM)
## Linear mixed model fit by REML ['lmerMod']
## Formula: ldl ~ age + intake_fat + intake_chol + systolic + diastolic +  
##     weight + height + bmi + triglycerides + (1 | gender) + (1 |      race)
##    Data: DT
## 
## REML criterion at convergence: 24691.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.3525 -0.6773 -0.0739  0.5917  4.4911 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  race     (Intercept)    0.695  0.8337 
##  gender   (Intercept)    0.000  0.0000 
##  Residual             1111.545 33.3398 
## Number of obs: 2503, groups:  race, 6; gender, 2
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)   111.161885  43.863459   2.534
## age             0.175052   0.038036   4.602
## intake_fat      0.001994   0.022497   0.089
## intake_chol    -0.001512   0.004350  -0.348
## systolic       -0.009931   0.045057  -0.220
## diastolic       0.411477   0.056723   7.254
## weight          0.309732   0.265604   1.166
## height         -0.351071   0.264700  -1.326
## bmi            -0.630334   0.737863  -0.854
## triglycerides   0.141182   0.011105  12.714
## 
## Correlation of Fixed Effects:
##             (Intr) age    intk_f intk_c systlc distlc weight height bmi   
## age          0.073                                                        
## intake_fat   0.047  0.053                                                 
## intake_chol -0.011 -0.066 -0.596                                          
## systolic    -0.074 -0.458  0.027  0.015                                   
## diastolic    0.011  0.102 -0.063  0.033 -0.302                            
## weight       0.961  0.071  0.017 -0.012 -0.009  0.022                     
## height      -0.992 -0.054 -0.071  0.009 -0.003 -0.051 -0.963              
## bmi         -0.957 -0.081 -0.016  0.004 -0.003 -0.040 -0.991  0.955       
## triglycerds -0.073 -0.134  0.053 -0.039 -0.047 -0.081 -0.093  0.077  0.067
## convergence code: 0
## boundary (singular) fit: see ?isSingular

According to the fitting result, the age, intake_fat, diastolic, weight and triglycerides are positively correlated, while the others are negatively correlated; the t-values of triglycerides, diastolic and age are really large, shows that they are important variables to this model. However, according to the random effects, the variance for race is 0.695, for gender is 0, and for residual is 1111.545, implies that the linear mixed model may not work really well.

Graphing

## Graph 1: Some Diagnosis upon these models
### F1: Checking error assumptions--residual plots
R1=data.table(fitted_values=L2$fitted.values,residuals=L2$residuals)
R2=data.table(fitted_values=L3$fitted.values,residuals=L3$residuals)
R3=data.table(fitted_values=L4$fitted.values,residuals=L4$residuals)
R4=data.table(fitted_values=L5$fitted.values,residuals=L5$residuals)
rs1=ggplot(R1,aes(x=fitted_values,y=residuals))+geom_point(size=1,colour='blue')+
  labs(title='Model 1')
rs2=ggplot(R2,aes(x=fitted_values,y=residuals))+geom_point(size=1,colour='blue')+
  labs(title='Model 2')
rs3=ggplot(R3,aes(x=fitted_values,y=residuals))+geom_point(size=1,colour='blue')+
  labs(title='Model 3')
rs4=ggplot(R4,aes(x=fitted_values,y=residuals))+geom_point(size=1,colour='blue')+
  labs(title='Model 4') 
grid.arrange(rs1,rs2,rs3,rs4,nrow=2)

## Graph 2: QQ-plots of the models
par(mfrow=c(2,2))
qqnorm(R1$residuals, ylab="Residuals",main='Q-Q Plot of Model 1')
qqline(R1$residuals)
qqnorm(R2$residuals, ylab="Residuals",main='Q-Q Plot of Model 2')
qqline(R2$residuals)
qqnorm(R3$residuals, ylab="Residuals",main='Q-Q Plot of Model 3')
qqline(R3$residuals)
qqnorm(R4$residuals, ylab="Residuals",main='Q-Q Plot of Model 4')
qqline(R4$residuals)

## Graph 3: Partial Residual Plots upon Model 3 and 4
crPlots(L4,layout=c(3,3))

crPlots(L5)

## Graph 4: Relationships between ldl and gender/race.  We have the mean level of
## ldl between different gender and race (For CI, we will use the JackKnife 
## standard error.)
Mean_JK = function(x){
  lx=length(x)
  MX=matrix(rep(x,rep(lx-1,lx)),ncol=lx,byrow=TRUE)
  theta=colMeans(MX)
  mean_theta=mean(theta)
  std_theta={(lx-1)/lx*sum((theta-mean_theta)^2)}^(1/2)
  std_theta
}

Gend=DT[,.(gender,ldl)]
Race=DT[,.(race,ldl)]
MG=Gend[,.(mean_ldl=mean(ldl),l_ldl=mean(ldl)+qnorm(0.025)*Mean_JK(ldl),
           r_ldl=mean(ldl)+qnorm(0.975)*Mean_JK(ldl)),by=gender]
MR=Race[,.(mean_ldl=mean(ldl),l_ldl=mean(ldl)+qnorm(0.025)*Mean_JK(ldl),
           r_ldl=mean(ldl)+qnorm(0.975)*Mean_JK(ldl)),by=race]

gend=ggplot(Gend,aes(x=gender,y=ldl))+geom_point(size=1,colour='blue')+
  labs(title='ldl~gender')
race=ggplot(Race,aes(x=race,y=ldl))+geom_point(size=1,colour='blue')+
  labs(title='ldl~race') 
mean_gend=ggplot(MG,aes(x=gender,y=mean_ldl))+geom_point(shape=16,col='red')+
  geom_segment(data=MG,mapping=aes(x=gender,xend=gender,y=l_ldl,yend=r_ldl),
               col='blue')+
  labs(title = 'Mean level of ldl between genders')
mean_race=ggplot(MR,aes(x=race,y=mean_ldl))+geom_point(shape=16,col='red')+
  geom_segment(data=MR,mapping=aes(x=race,xend=race,y=l_ldl,yend=r_ldl),
               col='blue')+
  labs(title = 'Mean level of ldl between races')

grid.arrange(gend,race,mean_gend,mean_race,nrow=2)

Here we plot the residual plots and the QQ-plots, and the result shows that the regression models satisfy the assumptions of the OLS model. By the way, we plot the relationships between ldl and gender and race, and the plots shows that the ldl level of male is slightly higher than female from the whole, and the levels for female are more centerized. For race, the ldl level of Other Race is the highest from the whole, and Non-Hispanic Black is the lowest.

Stata

I am interested in exploring the specific relationship between the level of triglyceride and the response varaiable \(\sqrt{ldl}\). In partcular, does the relationship varies with the specific quantile of triglyceride? I used quantile regression to find the answer.

qreg ldl2 triglyceride, quantile(0.05)
qreg ldl2 triglyceride, quantile(0.25)
qreg ldl2 triglyceride, quantile(0.5)
qreg ldl2 triglyceride, quantile(0.75)
qreg ldl2 triglyceride, quantile(0.90)

Quantile Regression Results

Quantile Regression Results

Quantile Regression Plot

Quantile Regression Plot

The quantile plot above shows that the level of triglyceride has positive correlation with the \(\sqrt{ldl}\) and the relationship between the two doesn’t vary much by quantile. Also, note that the quantile plot is generated with R and a detailed script is located under the folder /Xiru Lyu/quantile.R.

Python

Additional analysis

# additional analysis
#quantile regression
mod=smf.quantreg('LDL~AGE',Data)
quantiles = np.arange(.05, .96, .1)
def fit_model(q):       # learned seaborn from and this function origin from https://www.cnblogs.com/caiyishuai/p/11184166.html
    res = mod.fit(q=q)
    return [q, res.params['Intercept'], res.params['AGE']] + \
            res.conf_int().loc['AGE'].tolist()

models = [fit_model(x) for x in quantiles]
models = pd.DataFrame(models, columns=['q', 'a', 'b', 'lower_b', 'upper_b'])

ols = smf.ols('LDL ~ AGE', Data).fit()
ols_ci = ols.conf_int().loc['AGE'].tolist()
ols = dict(a = ols.params['Intercept'],
           b = ols.params['AGE'],
           lower_b = ols_ci[0],  # Ci for b,lowerbound
           upper_b = ols_ci[1])  #CI for b, upbound

print(models)
##       q          a         b   lower_b   upper_b
## 0  0.05   7.365460  0.003383 -0.004561  0.011326
## 1  0.15   8.134052  0.008011  0.003168  0.012855
## 2  0.25   8.476658  0.013114  0.008268  0.017960
## 3  0.35   8.848354  0.016674  0.012116  0.021231
## 4  0.45   9.222907  0.017595  0.013548  0.021642
## 5  0.55   9.687246  0.016327  0.012471  0.020183
## 6  0.65   9.993639  0.019470  0.015664  0.023275
## 7  0.75  10.422161  0.020762  0.016958  0.024566
## 8  0.85  11.041352  0.020834  0.016755  0.024912
## 9  0.95  12.041026  0.022692  0.017059  0.028326
print(ols)
## {'a': 9.590565977646861, 'b': 0.014434702126817935, 'lower_b': 0.011225620121011554, 'upper_b': 0.017643784132624317}
x = np.arange(Data.AGE.min(), Data.AGE.max()+1, 1)
get_y = lambda a, b: a + b * x

fig, ax = plt.subplots(figsize=(8, 6))

for i in range(models.shape[0]):
    y = get_y(models.a[i], models.b[i])
    ax.plot(x, y, linestyle='dotted', color='g')
## [<matplotlib.lines.Line2D object at 0x11f282588>]
## [<matplotlib.lines.Line2D object at 0x12a851710>]
## [<matplotlib.lines.Line2D object at 0x12a92f978>]
## [<matplotlib.lines.Line2D object at 0x12a92f908>]
## [<matplotlib.lines.Line2D object at 0x12a956400>]
## [<matplotlib.lines.Line2D object at 0x12a956940>]
## [<matplotlib.lines.Line2D object at 0x12a956128>]
## [<matplotlib.lines.Line2D object at 0x12a9560b8>]
## [<matplotlib.lines.Line2D object at 0x12a956860>]
## [<matplotlib.lines.Line2D object at 0x12a94c4e0>]
y = get_y(ols['a'], ols['b'])

ax.plot(x, y, color='red', label='OLS')
## [<matplotlib.lines.Line2D object at 0x12a94c5c0>]
ax.scatter(Data.AGE, Data.LDL, alpha=.2)
## <matplotlib.collections.PathCollection object at 0x12a94c518>
ax.set_xlim((10, 85))
## (10, 85)
ax.set_ylim((3.0, 17.5))
## (3.0, 17.5)
legend = ax.legend()
ax.set_xlabel('AGE', fontsize=16)
## Text(0.5, 0, 'AGE')
ax.set_ylabel('LDL(sqrt_LDL)', fontsize=16);
## Text(0, 0.5, 'LDL(sqrt_LDL)')
Quantile reg: sqrt(ldl)~age

Quantile reg: sqrt(ldl)~age

Discussion

All of the three tools above use the stepwise model selection technique to choose the best fitted model, but there are some differences existing among these tools.

For the tool using data.table in R, the model selection function is just as step(,direction=''), and the direction choose is both; note that the selection criterion is by AIC. For Stata, the model selection technique is performed using stepwise, and the selection criterion is based on p-values. As Stata cannot perform the stepwise selection in both directions, separate forward and backward selections are performed, and p-values are tuned so that model selection results match those by R. For Python, the code for backward and forward selections are downloaded from the given Github site, and the selection criteria is by AIC. As the model selection technique differs for Python in selecting categorical variables of multiple levels, the final regression results by Python are slightly different from those by R and Stata.

Results

Python

Note for Python, we use pandas to merge and clean all the dataset.Because there is no package about step regression to help select variables, so we refer to some self-written forward step regression function based on adjusted R-square to select variables. For GLM regression, we import module sklearn.linear_model.LinearRegression. As for the result, to accord with my partners, when choose variables ‘LBXTR’(triglycerides level), ‘BPXSY’(systolic blood pressure), ‘BPXDI’(diastolic blood pressure), ‘FAT’(average fat intake), ‘CHOL’(average Cholesterol intake), ‘GENDER’, ‘AGE’,‘RACE’, ‘HEIGHT’, ‘WEIGHT’, ‘BMI’,‘LBXTR2’(LBXTRLBXTR), ‘AGE2’(AGEAGE), I got the same result, the forward select function showed me that the significant variables should include LBXTR LBXTR2 BPXDI AGE AGE2 RACE BPXSY HEIGHT CHOL with adjusted R-square being 0.20377. After regression within LinearRegression module, the regression R-square was 0.207806, which means these data can express around 20% of cholesterol level change. Correspondingly, the coefficients of these variables are 4.37936363e-01, -9.94588617e-04, 1.12071396e-01,2.09660709e+00, -2.20899498e-02, 9.69819194e-01, 8.32880449e-02, -1.07237225e-01, -3.73454180e-03.

However it seems that in python, there is no good way to set a variable factor, such as race. So to fix this problem, I used mixed model in module statsmodel, and the result showed that coefficients be:intercept 38.613, LBXTR 0.436, LBXTR2 -0.001,BPXDI 0.119, AGE 2.101, AGE2 -0.022, BPXSY 0.080, HEIGHT -0.108, CHOL -0.004, and Group Var 1.897.

And from the result, we can notice that first gender has no effect on cholesterol level(mg/dl), second height to some extent affect the level rather than weight, though we may have intuition that fatter people may have more cholesterol amount, it does not change the cholesterol density(mg/dl) and maybe that is why variable fat intake here barely have impactions. Besides though we might consider that intake will increase the relevant material level, here we see that cholesterol average intake in two days decrease the density of blood cholesterol, it might be because that two day records are not representative for a long term intake and adjust system may react to food intake in a short time just like blood glucose level in normal life, and the p-value for CHOL is the biggest 0.258 which means its true value could be zero. Third, we can notice that ‘BPXDI’(diastolic blood pressure) impact more than ‘BPXSY’(systolic blood pressure). Forth, with age increasing, the cholesterol density(mg/dl) will increase by 2mg/dl per year, but the increase rate will slow down for coefficient of AGE2 is negative.